Here we use the following links to data.
GADM_census_path <- "~/Desktop/OUCRU/FacebookData/ColocationMarc/GADM_census.rds"
coloc_mat_path <- "~/Desktop/OUCRU/FacebookData/ColocationMarc/coloc_mat.rds"
GADM_country_path <- "~/Desktop/OUCRU/FacebookData/ColocationMarc/gadm36_VNM_0_sf.rds"
colocation_path <- "~/Desktop/OUCRU/FacebookData/ColocationMarc/colocation2.rds"
coloc_line_path <- "~/Desktop/OUCRU/Project/meta_contact_model/dist_linestrings.RDS"
Change them accordingly if you want to run the script locally on your computer.
The needed packages:
library(sf)
library(stars)
library(tidyr)
library(magrittr)
library(purrr)
library(dplyr) # best to load last
library(reshape2)
library(data.table)
library(ggplot2)
library(lubridate)
setwd("~/Desktop/OUCRU/FacebookData/ColocationMarc")
dist_polygons <- readRDS(GADM_census_path)
coloc_mat <- readRDS(coloc_mat_path)
vn0 <- readRDS(GADM_country_path)
worldpop <- read_stars("VNM_ppp_v2b_2020_UNadj.tif")
colocation <- readRDS(colocation_path)
Redefining the hist() function:
hist2 <- function(...) hist(..., main = NA)
The following model uses colocation data to formulate the coupling in a SEIR metapopulation model. The equations that describe the epidynamics in each of the populations is given below: \[ \begin{aligned} \frac{dS_i}{dt} & = - S_i \sum_{j} \beta_{ij} \frac{I_j}{N_j} \\ \frac{dE_i}{dt} & = S_i \sum_{j} \beta_{ij} \frac{I_j}{N_j} - \sigma E_i \\ \frac{dI_i}{dt} & = \sigma E_i - \gamma I_i \\ \frac{dR_i}{dt} & = \gamma I_i \\ \end{aligned} \] where \(\frac{1}{\sigma} = 7\) days is the latency period and \(\frac{1}{\gamma} = 7\) is the recovery period. Note that \(N_i = S_i + E_i + I_i + R_i\). We assume that \(\beta_{ij}\) which represents the transmission rate from population \(j\) to population \(i\) is proportional to \(C_{ij}\), the colocation probability for population \(i\) and population \(j\).
The following parameterization is an adaptation of the contact model given in the paper, Modeling the impact of human mobility and travel restrictions on the potential spread of SARS-CoV-2 in Taiwan.
The \(\beta_{ij}\) are defined as follows: \[ \begin{aligned} \beta_{ij} = \gamma R_{0 ij} \end{aligned} \] where \[ \begin{aligned} R_{0 ij} & = R_{0 \: Hanoi'} \frac{C_{ij}}{C_{Hanoi'-Hanoi'}} \\ \end{aligned} \] and \(Hanoi'\) is the district in Hà Nội with the highest density. This assumes that the strength of transmission from district \(j\) to district \(i\) is dependent on colocation probabilities and independent of the relative population sizes of the interacting regions.
Let’s calculate the district in Hanoi with the highest density.
HanoiDist <- dist_polygons %>%
filter(province == "Hà Nội") %>%
arrange(desc(den_km2)) %>%
head(1) %>%
pull(district)
HanoiDistID <- dist_polygons %>%
filter(province == "Hà Nội") %>%
filter(district == HanoiDist) %>%
pull(polygon_id)
As such, we will set the \(R_0\) of Hoàn Kiếm to 18; that is, \(R_{0 \: Hanoi'} = 18\).
The following function computes the \(R_{0 ij}\) values with the standard reference district with \(R_0\) of standardR0 and district ID of standardDistID.
compute_R0 <- function(coloc_mat, standardR0, standardDistID) {
melt(coloc_mat) %>%
setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
mutate(R0 = standardR0 * link_value / coloc_mat[paste(standardDistID), paste(standardDistID)])
}
Let’s compute the \(R_{0 ij}\) with \(R_{0 \: Hanoi'} = 18\):
R0 <- compute_R0(coloc_mat, 18, HanoiDistID)
Intracity \(R_0\) is defined as \(R_ii\).
Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.
intracityR0 <- R0 %>%
filter(polygon1_id == polygon2_id) %>%
rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 60, 10))
axis(2)
cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 60, 10))
axis(2)
quantile(intracityR0$intraR0, seq(0, 1, le = 11))
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.000000 1.586063 2.008983 2.373354 2.887077 3.456204 4.453696 6.125704
## 80% 90% 100%
## 8.322006 13.101921 55.396777
Surprisingly, we see that there are several districts with intracity \(R_0\) values greater than 20. This is far higher than what you would expect for measles. Let’s see what cities in particular have these high intracity \(R_0\) values.
Let’s map the intracity \(R_0\) values for the districts:
intracityR0sf <- dist_polygons %>%
right_join(select(intracityR0, -polygon2_id), c("polygon_id" = "polygon1_id"))
The following district have intracity \(R0\) greater than 10.
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intracityR0sf %>%
select("intraR0") %>%
filter(intraR0 > 10) %>%
plot(add = TRUE, col = "red")
The following districts have intracity \(R0\) greater than 20.
intracityR0sf %>%
filter(intraR0 > 20) %>%
arrange(desc(intraR0)) %>%
st_drop_geometry() %>%
select(province, district, intraR0)
## province district intraR0
## 1 Yên Bái Trạm Tấu 55.39678
## 2 Hà Giang Hoàng Su Phì 42.67776
## 3 Quảng Nam Tây Giang 42.55601
## 4 Quảng Ngãi Tây Trà 41.49006
## 5 Kon Tum Kon Plông 33.77775
## 6 Thanh Hóa Mường Lát 32.71860
## 7 Sơn La Sốp Cộp 32.20555
## 8 Quảng Nam Phước Sơn 31.46382
## 9 Hà Giang Xín Mần 31.07934
## 10 Hà Giang Đồng Văn 29.85194
## 11 Quảng Ngãi Lý Sơn 29.61061
## 12 Hà Giang Mèo Vạc 28.03566
## 13 Cao Bằng Bảo Lạc 27.90237
## 14 Quảng Nam Nam Trà My 26.41899
## 15 Lạng Sơn Đình Lập 26.37892
## 16 Cao Bằng Bảo Lâm 24.83981
## 17 Sơn La Bắc Yên 24.52896
## 18 Quảng Ninh Ba Chẽ 23.66076
## 19 Quảng Ninh Bình Liêu 23.55896
## 20 Điện Biên Tủa Chùa 23.15901
## 21 Cao Bằng Thông Nông 23.08779
## 22 Điện Biên Điện Biên Đông 22.44800
## 23 Bình Thuận Phú Quí 22.33890
## 24 Hồ Chí Minh 4 22.28844
## 25 Gia Lai Ayun Pa 22.03421
## 26 Hà Giang Quản Bạ 20.98567
## 27 Hồ Chí Minh 3 20.84098
## 28 Lào Cai Si Ma Cai 20.67039
## 29 Quảng Trị Quảng Trị 20.22414
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intracityR0sf %>%
filter(intraR0 > 20) %>%
select(intraR0) %>%
plot(add = TRUE, col = "red")
It appears that remote regions of Vietnam have relatively higher intracity \(R_0\) values. Let’s verify this by graphing relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.
ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()
Both populations with relatively low population densities and populations with relatively high population densities have high intracity \(R_0\) values. Populations with especially low densities have the greatest intracity \(R_0\) values. Note that we should see a positive relationship between population density and intracity \(R_0\) values. The intracity \(R_0\) values need to be adjusted to fix the artifact of how the colocation matrix is constructed.
Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).
Let’s consider the intercity R0 values:
intercityR0 <- R0 %>%
filter(polygon1_id != polygon2_id) %>%
rename(interR0 = R0)
totalInter <- intercityR0 %>%
group_by(polygon1_id) %>%
summarize(interR0 = sum(interR0)) %>%
arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:
intercityR0sf <- dist_polygons %>%
right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intercityR0sf %>%
select("interR0") %>%
filter(interR0 > 1) %>%
plot(add = TRUE, col = "red")
Let’s graph the relationship between population density and intercity \(R_0\) values:
ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()
As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.
Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:
tmp <- intercityR0sf %>%
right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
geom_abline(slope = 1, intercept = 0)
There are two distinct groups in the plot: one group has larger intercity \(R_0\) values relative to intracity \(R_0\) values, while the other has larger intracity \(R_0\) values relative to intercity \(R_0\) values.
Let’s use a linear mixture model to assign each point to a particular linear model:
library(flexmix)
## Loading required package: lattice
set.seed(10)
model <- flexmix(interR0 ~ intraR0, tmp, k = 2)
summary(model)
##
## Call:
## flexmix(formula = interR0 ~ intraR0, data = tmp, k = 2)
##
## prior size post>0 ratio
## Comp.1 0.78 574 643 0.893
## Comp.2 0.22 124 666 0.186
##
## 'log Lik.' -498.5177 (df=7)
## AIC: 1011.035 BIC: 1042.873
plot(tmp$interR0, tmp$intraR0, col = clusters(model),
xlab = "intercity R0", ylab = "intracity R0")
Let’s map the clustering observed in the linear mixture model to the districts:
vn0 %>%
st_geometry() %>%
plot(col = "grey")
colors <- clusters(model)
colors[colors == 1] = 3
tmp %>%
transmute(cluster = clusters(model)) %>%
st_geometry() %>%
plot(add = TRUE, col = colors)
It is a little noisy, so let’s take a closer look at the districts that are “clearly” in one group or the other; that is, districts with posterior probabilities that indicate clear membership to one of the linear regression models.
postProb <- posterior(model) %>%
as.data.frame() %>%
transmute(model1 = V1, model2 = V2, cluster = clusters(model)) %>%
mutate(m1Vm2 = model1/model2, m2Vm1 = model2/model1)
tmpPost <- tmp %>%
transmute(cluster = postProb$cluster, m1Vm2 = postProb$m1Vm2,
m2Vm1 = postProb$m2Vm1)
vn0 %>%
st_geometry() %>%
plot(col = "grey")
tmpPost %>%
filter(cluster == 1) %>%
arrange(desc(m1Vm2)) %>%
head(50) %>%
st_geometry() %>%
plot(add = TRUE, col = 3)
tmpPost %>%
filter(cluster == 2) %>%
arrange(desc(m2Vm1)) %>%
head(50) %>%
st_geometry() %>%
plot(add = TRUE, col = 2)
The districts with high intercity \(R_0\) relative to intracity \(R_0\) are clustered around Hanoi and Ho Chi Minh, while the districts with low intercity \(R_0\) relative to intracity \(R_0\) are clustered in remote regions.
The \(\beta_{ik}\) are defined as follows: \[ \begin{aligned} \beta_{ij} = \gamma R_{0 ij} \end{aligned} \] where \[ \begin{aligned} R_{0 ij} & = R_{0 \: Hanoi'} \frac{C_{ij} N_j}{C_{Hanoi'-Hanoi'} \cdot N_{Hanoi'}} \\ \end{aligned} \] and \(Hanoi'\) is the district in Hà Nội with the highest density. The numerator represents the average number of individuals from population \(j\) that colocate with an individual from population \(i\) per unit time, whereas the denominator is the average number of colocations among individuals in the densest district in Hanoi per unit time. This formulation scales the colocation probability by the population size of the interacting district.
The following function computes the \(R_{0 ij}\) values with the standard reference district with \(R_0\) of standardR0 and district ID of standardDistID.
compute_R0 <- function(dist_data, coloc_mat, standardR0, standardDistID) {
popHanoiDist <- dist_data %>%
filter(polygon_id == HanoiDistID) %>%
pull(n)
coloc_mat <- reshape2::melt(coloc_mat) %>%
setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
dplyr::right_join(select(as.data.frame(dist_data), polygon_id, n),
c("polygon2_id" = "polygon_id")) %>%
dplyr::mutate(R0 = standardR0 * link_value * n /
(coloc_mat[paste(standardDistID), paste(standardDistID)] * popHanoiDist))
}
Let’s compute the \(R_{0 ij}\) with \(R_{0 \: Hanoi'} = 18\):
R0 <- compute_R0(dist_polygons, coloc_mat, 18, HanoiDistID)
Intracity \(R_0\) is defined as \(R_ii\).
Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.
intracityR0 <- R0 %>%
filter(polygon1_id == polygon2_id) %>%
rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 20, 2))
axis(2)
cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 20, 2))
axis(2)
quantile(intracityR0$intraR0, seq(0, 1, le = 11))
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.000000 1.007264 1.242158 1.464806 1.650588 1.908886 2.221608 2.554752
## 80% 90% 100%
## 3.258339 4.705081 18.142725
The intracity \(R_0\) values are far more reasonable for measles.
Let’s map the intracity \(R_0\) values for the districts:
intracityR0sf <- dist_polygons %>%
right_join(select(intracityR0, -polygon2_id, -n), c("polygon_id" = "polygon1_id"))
The following district have intracity \(R0\) greater than 5.
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intracityR0sf %>%
select("intraR0") %>%
filter(intraR0 > 5) %>%
plot(add = TRUE, col = "red")
While there is slightly more clustering around the population centers of Hanoi and Ho Chi Minh, there is still high intracity \(R_0\) values in relatively remote regions.
Let’s explore the relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.
ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()
This is a large improvement from the previous model, but there are still districts with low densities that have high intracity \(R_0\) values. One would assume that remote regions with low densities would have smaller within-district colocation probabilities because people are so sparsely located and rarely interact. This property, however, may be due to the fact that the majority of the population within these small districts reside only a small area of the much larger district; therefore, although the district has a low population density overall, the areas where residents actually reside may be quite dense. This could result in a high within-district colocation probability and consequently a high intracity \(R_0\). This hypothesis will be tested below.
Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).
Let’s consider the intercity R0 values:
intercityR0 <- R0 %>%
filter(polygon1_id != polygon2_id) %>%
rename(interR0 = R0)
totalInter <- intercityR0 %>%
group_by(polygon1_id) %>%
summarize(interR0 = sum(interR0)) %>%
arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:
intercityR0sf <- dist_polygons %>%
right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intercityR0sf %>%
select("interR0") %>%
filter(interR0 > 1) %>%
plot(add = TRUE, col = "red")
The districts with high intercity \(R_0\) values are clustered in high density/highly connected regions including Hanoi and Ho Chi Minh, which is what we would expect.
Let’s graph the relationship between population density and intercity \(R_0\) values:
ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()
As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions. Implicit in this analysis is the assumption that districts with higher population densities are better connected to surrounding regions. In general, this assumption holds because districts which have high population densities tend to have advanced transport networks that facilitate travel between other districts.
Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:
tmp <- intercityR0sf %>%
right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
geom_abline(slope = 1, intercept = 0)
The grouping that we observed in the first model has largely disappeared. There are a few disticts with high intracity \(R_0\) values in comparison to intercity \(R_0\) which may be a point of concern. As we would expect, however, the intracity \(R_0\) values are weakly proportional to the intercity \(R_0\) values.
Let’s define the risk of infection for a location \(i\) as follows: \[ \begin{aligned} \text{Risk of infection for location i} = \frac{\sum_{j \: includes \: i } R_{0 ij}}{\text{max}_l(\sum_{j \: includes \: l} R_{0 lj})} \end{aligned} \] The sum of intracity \(R_0\) and intercity \(R_0\) reflects total risk of infection and was standardized to the highest sum of intracity \(R_0\) and intercity \(R_0\).
Let’s calculate it for all districts:
ROIdf <- R0 %>%
group_by(polygon1_id) %>%
summarize(ROI = sum(R0)) %>%
rename(polygon_id = polygon1_id) %>%
arrange(desc(ROI)) %>%
mutate(ROI = ROI / head(., 1)$ROI)
## `summarise()` ungrouping output (override with `.groups` argument)
ROIsf <- dist_polygons %>%
right_join(ROIdf, "polygon_id")
It look like this:
ROIsf %>%
arrange(desc(ROI)) %>%
select(province, district, den_km2, ROI) %>%
st_drop_geometry() %>%
head(30)
## province district den_km2 ROI
## 1 Hồ Chí Minh 3 40590.0774 1.0000000
## 2 Hồ Chí Minh 10 41979.8331 0.9462927
## 3 Hà Nội Hoàn Kiếm 44315.7927 0.9253216
## 4 Hồ Chí Minh 5 41120.6293 0.8626589
## 5 Hồ Chí Minh 11 41427.8192 0.8570843
## 6 Hồ Chí Minh 4 41772.4692 0.8570194
## 7 Hồ Chí Minh Phú Nhuận 37221.5664 0.8359342
## 8 Hồ Chí Minh 1 20906.4535 0.7298933
## 9 Hồ Chí Minh Bình Thạnh 24932.7512 0.7215633
## 10 Hồ Chí Minh 6 36560.3898 0.7173413
## 11 Hồ Chí Minh Tân Phú 32256.2361 0.7099070
## 12 Hà Nội Hai Bà Trưng 32389.5908 0.6873188
## 13 Hà Nội Đống Đa 36694.1023 0.6848160
## 14 Hồ Chí Minh Gò Vấp 33762.4512 0.6666487
## 15 Hồ Chí Minh Tân Bình 17625.4264 0.6195264
## 16 Hà Nội Thanh Xuân 30377.4820 0.5958250
## 17 Hà Nội Ba Đình 26681.5236 0.5931674
## 18 Hồ Chí Minh 8 21146.4587 0.5394637
## 19 Hà Giang Hoàng Su Phì 104.9743 0.4910995
## 20 Đà Nẵng Thanh Khê 21657.0815 0.4828190
## 21 Hà Nội Cầu Giấy 19577.0573 0.4653514
## 22 Hồ Chí Minh 7 9740.5280 0.4205578
## 23 Hà Giang Đồng Văn 174.0675 0.4133473
## 24 Hải Phòng Lê Chân 18121.3750 0.4050484
## 25 Hà Giang Mèo Vạc 147.2357 0.4035337
## 26 Hà Nội Hoàng Mai 9864.9833 0.3924667
## 27 Đà Nẵng Hải Châu 9491.6448 0.3900344
## 28 Hồ Chí Minh Bình Tân 11968.4052 0.3760759
## 29 Hà Giang Xín Mần 114.7977 0.3608692
## 30 Hà Nội Tây Hồ 6257.0554 0.3475357
The districts with the highest risk of infection are located in Hanoi and Ho Chi Minh provinces. There are a few districts with very low population densities (Ha Giang Province in Northern Vietnam) that have relatively high risks of infection, which are presumably attributable to the fact that they are concentrated within a small region of the much large district and do not travel outside the population leading to very high \(C_{ii}\) values in the colocation matrix. This is tested below.
hist2(ROIsf$ROI, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1, 0.1))
axis(2)
Let’s map the risk of infection to the districts:
ROIsf %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(ROIsf$ROI, quantile(ROIsf$ROI, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
Let’s examine the relationship between population density and risk of infection:
ggplot(ROIsf, aes(x = log10(den_km2), y = ROI)) +
geom_point() +
geom_point(data = filter(ROIsf, province == "Hà Giang"), aes(x= log10(den_km2), y = ROI), color='red')
Generally, risk of infection increases with population density. We see, however, that some populations with low densities have high risks of infection. In particular, the districts in the Hà Giang province, which have particularly high ROI for the relatively low population densities are highlighted red.
Let’s break down the risk of infection into risk of infection due to intracity transmission and risk of infection due to intercity transmission for each district.
tmp <- totalInter %>%
right_join(select(intracityR0, polygon1_id, intraR0, n), "polygon1_id") %>%
rename(polygon_id = polygon1_id) %>%
mutate(interROI = interR0 / (interR0 + intraR0)) %>%
filter(interROI != "NaN")
interROI <- dist_polygons %>%
right_join(select(tmp, polygon_id, interR0, intraR0, interROI), "polygon_id")
It look like this:
interROI %>%
arrange(desc(interROI)) %>%
select(polygon_id, province, district, den_km2, interROI) %>%
st_drop_geometry() %>%
head(30)
## polygon_id province district den_km2 interROI
## 1 834334 Bà Rịa - Vũng Tàu Côn Đảo 81.96605 1.0000000
## 2 834482 Kiên Giang Kiên Hải 739.64350 1.0000000
## 3 834281 Quảng Nam Hội An 1561.87028 1.0000000
## 4 834741 Quảng Ninh Cô Tô 122.76955 1.0000000
## 5 834792 Quảng Ninh Vân Đồn 87.10470 1.0000000
## 6 834869 Long An Tân Trụ 662.14180 0.6387840
## 7 834346 Tiền Giang Tân Phước 193.30883 0.5634492
## 8 834400 Quảng Ngãi Sơn Tịnh 433.67214 0.5612937
## 9 834248 Thanh Hóa Đông Sơn 965.41999 0.5547872
## 10 834706 Hồ Chí Minh Nhà Bè 1190.04327 0.5365434
## 11 834876 Tiền Giang Chợ Gạo 855.24199 0.5346049
## 12 834487 Cần Thơ Cái Răng 1411.69547 0.5070858
## 13 834887 Nam Định Mỹ Lộc 1036.96940 0.5060699
## 14 834343 Long An Châu Thành 740.58097 0.4913874
## 15 834710 Long An Cần Giuộc 918.48350 0.4901447
## 16 834447 Vĩnh Long Long Hồ 812.15741 0.4859359
## 17 834401 Quảng Ngãi Tư Nghĩa 680.44044 0.4826807
## 18 834495 Hậu Giang Châu Thành 611.27480 0.4796601
## 19 834883 Hồ Chí Minh Bình Chánh 2011.09816 0.4789702
## 20 834349 Tiền Giang Gò Công Tây 718.19836 0.4559759
## 21 834279 Đà Nẵng Hòa Vang 186.92844 0.4558046
## 22 834820 Hải Phòng Dương Kinh 1219.97605 0.4547509
## 23 834448 Vĩnh Long Mang Thít 639.71911 0.4541046
## 24 834490 Cần Thơ Phong Điền 835.00028 0.4478597
## 25 834296 Quảng Nam Phú Ninh 347.33045 0.4450557
## 26 834277 Đà Nẵng Ngũ Hành Sơn 2029.21969 0.4428092
## 27 834863 Bạc Liêu Vĩnh Lợi 429.29424 0.4411020
## 28 834691 Hồ Chí Minh 1 20906.45354 0.4408914
## 29 834379 Nghệ An Hưng Nguyên 764.91638 0.4406954
## 30 834450 Vĩnh Long Tam Bình 558.08908 0.4405246
Let’s map the percent ROI attributable to intercity transmission:
interROI %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(interROI$interROI, quantile(interROI$interROI, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
Many of the districts with high ROI actually have relatively low ROI attributable to intercity transmission. This suggests that such the populations of such districts are isolated and a concentrated within a small region of the much larger district. Therefore, if a infected individual were to be introduced into a remote population with a high ROI (a rare event), the disease transmission within the district will be high because the individuals in the community are strongly connected. This will rarely happen, however, given how weak such a districts intercity interactions are with other populations.
The Hà Giang Province has a number of districts with high intracity \(R_0\) values in spite of their relatively low population densities; in fact, their population densities place them in the first quartile among all districts in Vietnam. One would assume that remote regions with low densities would have smaller within-district colocation probabilities because people are so sparsely located and rarely interact. The counter-intuitive property displayed by some districts in the Hà Giang Province may be attributable to the fact that the majority of the population within these districts reside only a small area of the much larger district; therefore, although the district has a low population density overall, the areas where residents actually reside may be quite dense. This could result in a high within-district colocation probability and consequently a high intracity \(R_0\).
There are 10 districts in the Hà Giang province, which is identified below:
plot(st_geometry(vn0), col = "grey")
dist_polygons %>%
filter(province == "Hà Giang") %>%
st_geometry() %>%
plot(add = TRUE, col = "yellow")
In order to characterize how evenly the population is spread in the district, the actual population density of each Bing pixel is compared to the population density of each Bing pixel if the population was perfectly spread out over all the pixels in the district. This is captured by the shaded area in the following graph, where a smaller area corresponds to a more spread out population.
The following function computes the area-under-curve (AUC) statistic for the district with polygon_ID according to the worldpop data set.
compute_AUC <- function(polygon_ID, polygons = dist_polygons, rstr = worldpop) {
dist <- polygons %>%
filter(polygon_id == polygon_ID)
tmp <- worldpop %>%
st_crop(dist) %>%
st_as_stars() %>%
unlist() %>%
na.omit() %>%
as.vector() %>%
sort(decreasing = TRUE)
cumsum <- (tmp / sum(tmp)) %>% cumsum()
(cumsum / length(cumsum)) %>% sum() - 0.5
}
Let’s compute the area-under-curve (AUC) statistic for each of the districts:
dist_AUCs <- unlist(purrr::map(pull(dist_polygons, polygon_id), compute_AUC)) %>%
data.frame(polygon_id = pull(dist_polygons, polygon_id), AUC = .)
Let’s check whether there are any districts with no population according to WorldPop:
no_pop <- dist_AUCs %>%
filter(AUC == -0.5) %>%
pull(polygon_id)
dist_polygons %>%
filter(polygon_id %in% no_pop) %>%
select(polygon_id, province, district) %>%
st_drop_geometry()
## polygon_id province district
## 1 834334 Bà Rịa - Vũng Tàu Côn Đảo
## 2 834302 Bình Thuận Phú Quí
## 3 850002 Hải Phòng Bạch Long Vĩ
## 4 850004 Quảng Trị Cồn Cỏ
Let’s map these districts without population data:
vn0 %>%
st_geometry() %>%
plot(col = "grey")
dist_polygons %>%
filter(polygon_id %in% no_pop) %>%
st_centroid() %>%
st_geometry() %>%
plot(pch = 3, col = 'red', add = TRUE)
All the districts without population data are islands off the coast of Vietnam.
Let’s remove these data points from the AUC data set.
dist_AUCs %<>%
filter(AUC != -0.5)
Let’s examine the distribution of the AUC statistics:
hist2(dist_AUCs$AUC, n = 50, axes = FALSE,
xlab = "area-under-curve (AUC)", ylab = "number of districts")
axis(1, seq(0, 0.4, 0.05))
axis(2, seq(0, 50, 10))
tmp <- ROIsf %>%
right_join(dist_AUCs, "polygon_id")
ggplot(tmp, aes(x = log10(den_km2), y = ROI)) + geom_point(aes(color = AUC))
Those districts with high ROI (above 0.25) and low density (below 300 / km2) do not necessarily have low AUC values. As such, the hypothesis that this property is attributable to the fact that the the majority of the population is concentrated in a small area relative to the district as whole may not hold. An alternative explanation may need to be sought.
Let’s map the AUC distribution:
dist_polygons %>%
right_join(dist_AUCs, "polygon_id") %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(dist_AUCs$AUC, quantile(dist_AUCs$AUC, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
The darker colored districts have higher AUC values; that is, districts whose populations are less evenly spread out will be colored a darker.
In the above model, the census population is used rather than the facebook population even though the colocation probabilities are based on calculations using the facebook population. As such, the artifacts in the data may be attributable to the use of the census population which may be inconsistent with the facebook population.
Let’s compare the facebook population data with the census population data. The following function combines the facebook data with the GADM / census data for each week:
dist_fb_populations <- function(x) {
x %>%
select(polygon1_id, fb_population_1) %>%
distinct() %>%
right_join(select(st_drop_geometry(dist_polygons), -area_km2, -den_km2), c("polygon1_id" = "polygon_id"))
}
Let’s compute it:
tmp <- map(colocation, dist_fb_populations)
Let’s consider the change in facebook population for each district over time. Note that each black line represents the facebook population of a district over time.
xs <- ymd(names(colocation)) - 6
plot(xs, seq_along(xs), ylim = c(0, 5), type = "n")
tmp %>%
map_dfc(pull, fb_population_1) %>%
t() %>%
as.data.frame() %>%
map(log10) %>%
walk(lines, x = xs, col = adjustcolor("black", .25))
The facebook population of each district does not seem to change as a function of time.
And the distribution accross district looks like this:
tmp %>%
map(mutate, prop = fb_population_1 / n) %>%
first() %>%
pull(prop) %>%
hist2(50, xlab = "facebook coverage", ylab = "number of districts")
Let’s look at the facebook coverage as a function of the population size for the first week of the colocation data:
fb_pop <- colocation$`2020-03-03` %>%
select(polygon1_id, fb_population_1) %>%
distinct() %>%
right_join(select(st_drop_geometry(dist_polygons), -area_km2, -den_km2), c("polygon1_id" = "polygon_id"))
model <- lm(log10(fb_population_1) ~ log10(n), fb_pop)
summary(model)
##
## Call:
## lm(formula = log10(fb_population_1) ~ log10(n), data = fb_pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03536 -0.19070 -0.01065 0.18854 1.33225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.55716 0.18196 -25.05 <2e-16 ***
## log10(n) 1.48787 0.03592 41.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2796 on 690 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.7132, Adjusted R-squared: 0.7128
## F-statistic: 1716 on 1 and 690 DF, p-value: < 2.2e-16
fb_pop %>%
mutate(color = ifelse(province %in% c("Hà Giang"), "red", "blue")) %>%
with(plot(log10(n), log10(fb_population_1), col = color, axes = FALSE,
ylim = c(1, 4.5), xlim = c(3.8, 6),
xlab = "district population", ylab = "facebook population"))
abline(model, col = "green")
axis(1, 4:6, format(10000 * c(1, 10, 100), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10 * c(1, 10, 100, 1000), big.mark = ",", scientific = FALSE, trim = TRUE))
Meaning that the facebook coverage increases with population size. Note that the points corresponding to the Hà Giang province are highlighted red.
Let’s explore the relationship between the residuals of the above linear regression with the ROI/density plot derived from the above metapopulation model:
no_data_dist <- fb_pop[is.na(fb_pop$fb_population_1), ]$polygon1_id
`%notin%` <- Negate(`%in%`)
tmp <- fb_pop %>%
filter(polygon1_id %notin% no_data_dist) %>%
mutate(residual = model$residuals)
The distribution of the residual values is given by:
tmp %>%
pull(residual) %>%
hist2(50, xlab = "residual", ylab = "number of districts")
ROIsf %>%
right_join(select(tmp, polygon1_id, residual), c("polygon_id" = "polygon1_id")) %>%
ggplot(aes(x = log10(den_km2), y = ROI)) + geom_point(aes(color = residual)) +
scale_color_continuous(low="yellow", high="blue")
The darker points represent districts with a higher, more positive residual, whereas the lighter points represent districts with a lower, more negative residual. A higher residual means that the facebook population is greater than what would be predicted from the census population, whereas a lower residual means that the facebook population is smaller than what would be predicted from the census population. We see that districts with low density but high ROI are generally characterized by more negative residuals. This is somewhat expected because a smaller facebook population relative to the census population means that a small fraction of individuals largely determine the colocation probability. If this small fraction of individuals are disproportionately connected relative to the district population as a whole, multiplying the colocation probability by the entire district population as given by the census data would result in a larger than expected ROI relative to population density.
Connecitivity web of colocation probabilities among districts.
Let’s create an sf object that contains linestrings between all possible pairs of districts in Vietnam:
all_links <- melt(coloc_mat) %>%
setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
filter(polygon1_id != polygon2_id) %>%
as.data.frame() %>%
right_join(select(st_centroid(dist_polygons), polygon_id, geometry), c("polygon1_id" = "polygon_id")) %>%
rename(geometry1 = geometry) %>%
right_join(select(st_centroid(dist_polygons), polygon_id, geometry), c("polygon2_id" = "polygon_id")) %>%
rename(geometry2 = geometry)
dist_linestrings <- st_sfc(mapply(function(a,b){st_cast(st_union(a,b),"LINESTRING")}, all_links$geometry1, all_links$geometry2, SIMPLIFY=FALSE)) %>%
as.data.frame() %>%
mutate(link_value = all_links$link_value) %>%
st_as_sf()
dist_linestrings <- readRDS(coloc_line_path)
Let’s map these connections as a web:
vn0 %>%
st_geometry() %>%
plot(col = "grey")
dist_linestrings %>%
plot(lwd = 0.05, col = pal[cut(all_links$link_value, quantile(all_links$link_value, seq(0, 1, le = 11)))], add = TRUE)
The darker the link color, the stronger the link.
As a reference, the distribution of the link values between districts is given below:
hist2(log2(all_links$link_value), n = 50,
xlab = "log2(link value)", ylab = "number of districts")
Let’s consider only the strongest links:
vn0 %>%
st_geometry() %>%
plot(col = "grey")
dist_linestrings %>%
arrange(link_value) %>%
tail(30000) %>%
plot(lwd = 0.15, col = pal[cut(.$link_value, quantile(.$link_value, seq(0, 1, le = 11)))], add = TRUE)
The darker the link color, the stronger the link. Note the clustering of dark regions.
Let’s see whether the artifact disappears when we use facebook population data rather than census population data:
Let’s check whether there are any districts without facebook population data:
fb_pop[is.na(fb_pop$fb_population_1), ]
## # A tibble: 6 x 7
## polygon1_id fb_population_1 province district n lon lat
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 850001 NA Bình Phước Phú Riềng 97931 107. 11.7
## 2 834775 NA Điện Biên Điện Biên Đông 59550 103. 21.2
## 3 850002 NA Hải Phòng Bạch Long Vĩ 912 108. 20.1
## 4 850003 NA Kon Tum Ia H' Drai 6638 108. 14.2
## 5 850004 NA Quảng Trị Cồn Cỏ 400 107. 17.2
## 6 850005 NA Sơn La Vân Hồ 61197 105. 20.8
no_data_dist <- fb_pop[is.na(fb_pop$fb_population_1), ]$polygon1_id
Let’s map these points:
vn0 %>%
st_geometry() %>%
plot(col = "grey")
dist_polygons %>%
filter(polygon_id %in% no_data_dist) %>%
st_centroid() %>%
st_geometry() %>%
plot(pch = 3, col = 'red', add = TRUE)
Let’s estimate the facebook population of these districts based on the linear regression given above:
data <- data.frame(n = fb_pop[is.na(fb_pop$fb_population_1), ]$n)
fb_pop[is.na(fb_pop$fb_population_1), ]$fb_population_1 <- 10^predict(model, newdata = data)
Let’s verify whether these estimates make sense:
fb_pop %>%
mutate(color = ifelse(polygon1_id %in% no_data_dist, "red", "blue")) %>%
with(plot(log10(n), log10(fb_population_1), col = color, axes = FALSE,
ylim = c(1, 4.5), xlim = c(3.8, 6),
xlab = "district population", ylab = "facebook population"))
axis(1, 4:6, format(10000 * c(1, 10, 100), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10 * c(1, 10, 100, 1000), big.mark = ",", scientific = FALSE, trim = TRUE))
The estimated points (colored red) are consistent with the actual population data.
The following function computes the \(R_0\) under the facebook population adjusted model:
compute_R0 <- function(dist_data, coloc_mat, standardR0, standardDistID) {
popHanoiDist <- dist_data %>%
filter(polygon1_id == HanoiDistID) %>%
pull(fb_population_1)
coloc_mat <- reshape2::melt(coloc_mat) %>%
setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
dplyr::right_join(select(as.data.frame(dist_data), polygon1_id, fb_population_1),
c("polygon2_id" = "polygon1_id")) %>%
dplyr::mutate(R0 = standardR0 * link_value * fb_population_1 /
(coloc_mat[paste(standardDistID), paste(standardDistID)] * popHanoiDist))
}
Let’s compute the \(R_{0 ij}\) with \(R_{0 \: Hanoi'} = 18\):
R0 <- compute_R0(fb_pop, coloc_mat, 18, HanoiDistID)
Intracity \(R_0\) is defined as \(R_ii\).
Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.
intracityR0 <- R0 %>%
filter(polygon1_id == polygon2_id) %>%
rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)
cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 30, 2))
axis(2)
quantile(intracityR0$intraR0, seq(0, 1, le = 11))
## 0% 10% 20% 30% 40% 50% 60%
## 0.0000000 0.4798879 0.6141574 0.7499863 0.8732814 1.0192806 1.2171663
## 70% 80% 90% 100%
## 1.6096531 2.7136535 4.7678751 32.4549638
Let’s map the intracity \(R_0\) values for the districts:
intracityR0sf <- dist_polygons %>%
right_join(select(intracityR0, -polygon2_id, -fb_population_1), c("polygon_id" = "polygon1_id"))
The following district have intracity \(R0\) greater than 10.
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intracityR0sf %>%
select("intraR0") %>%
filter(intraR0 > 10) %>%
plot(add = TRUE, col = "red")
We see that there are two clusters: one in Hanoi Province and one in Ho Chi Minh Province. This is a significant improvement to the previous model, which considered census population data.
Let’s explore the relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.
ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()
There are no longer districts with low densities that have high intracity R0 values as observed in previous formulations of the model.
Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).
Let’s consider the intercity R0 values:
intercityR0 <- R0 %>%
filter(polygon1_id != polygon2_id) %>%
rename(interR0 = R0)
totalInter <- intercityR0 %>%
group_by(polygon1_id) %>%
summarize(interR0 = sum(interR0)) %>%
arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.
hist2(totalInter$interR0, n = 50, xlab = "Intercity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)
Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:
intercityR0sf <- dist_polygons %>%
right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intercityR0sf %>%
select("interR0") %>%
filter(interR0 > 1) %>%
plot(add = TRUE, col = "yellow")
The districts with high intercity \(R_0\) values are clustered in high density/highly connected regions including Hanoi and Ho Chi Minh, which is what we would expect.
Let’s graph the relationship between population density and intercity \(R_0\) values:
ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()
As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.
Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:
tmp <- intercityR0sf %>%
right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
geom_abline(slope = 1, intercept = 0)
The grouping that we observed in the first and second model has largely disappeared. As we would expect, the intracity \(R_0\) values are weakly proportional to the intercity \(R_0\) values.
Let’s calculate it for all districts:
distROIdf <- R0 %>%
group_by(polygon1_id) %>%
summarize(ROI = sum(R0)) %>%
rename(polygon_id = polygon1_id) %>%
arrange(desc(ROI)) %>%
mutate(ROI = ROI / head(., 1)$ROI)
## `summarise()` ungrouping output (override with `.groups` argument)
distROIsf <- dist_polygons %>%
right_join(distROIdf, "polygon_id")
It look like this:
distROIsf %>%
arrange(desc(ROI)) %>%
select(province, district, den_km2, ROI) %>%
st_drop_geometry() %>%
head(30)
## province district den_km2 ROI
## 1 Hồ Chí Minh 10 41979.833 1.0000000
## 2 Hồ Chí Minh 3 40590.077 0.9656127
## 3 Hồ Chí Minh 5 41120.629 0.9248474
## 4 Hồ Chí Minh 4 41772.469 0.9221712
## 5 Hồ Chí Minh Phú Nhuận 37221.566 0.8867105
## 6 Hồ Chí Minh 11 41427.819 0.8793910
## 7 Hồ Chí Minh Tân Bình 17625.426 0.8404917
## 8 Hồ Chí Minh 1 20906.454 0.8236542
## 9 Hồ Chí Minh Tân Phú 32256.236 0.7981342
## 10 Hồ Chí Minh Bình Thạnh 24932.751 0.7702293
## 11 Hồ Chí Minh 6 36560.390 0.7503230
## 12 Hà Nội Đống Đa 36694.102 0.7116461
## 13 Hồ Chí Minh Gò Vấp 33762.451 0.7067554
## 14 Hà Nội Hai Bà Trưng 32389.591 0.6940107
## 15 Hà Nội Thanh Xuân 30377.482 0.6708553
## 16 Hồ Chí Minh 8 21146.459 0.6351546
## 17 Hồ Chí Minh Bình Tân 11968.405 0.6176974
## 18 Hà Nội Hoàn Kiếm 44315.793 0.6111839
## 19 Hà Nội Cầu Giấy 19577.057 0.6002443
## 20 Hồ Chí Minh 7 9740.528 0.5880481
## 21 Hà Nội Ba Đình 26681.524 0.5679395
## 22 Hà Nội Hoàng Mai 9864.983 0.5333316
## 23 Hồ Chí Minh 12 9273.140 0.4916247
## 24 Hà Nội Từ Liêm 7104.665 0.4374471
## 25 Hồ Chí Minh Thủ Đức 10881.460 0.4243820
## 26 Bình Dương Dĩ An 6111.717 0.3764839
## 27 Hà Nội Hà Đông 5997.269 0.3671120
## 28 Hà Nội Tây Hồ 6257.055 0.3611581
## 29 Đà Nẵng Thanh Khê 21657.081 0.3584673
## 30 Hồ Chí Minh 2 3414.600 0.3573184
The districts with the highest risk of infection are located in Hanoi and Ho Chi Minh provinces. There are no longer any districts with low population densities but relatively high risks of infection.
hist2(distROIsf$ROI, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1, 0.1))
axis(2)
Let’s map the risk of infection to the districts:
distROIsf %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(distROIsf$ROI, quantile(distROIsf$ROI, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
Let’s examine the relationship between population density and risk of infection:
ggplot(distROIsf, aes(x = log10(den_km2), y = ROI)) +
geom_point() +
geom_point(data = filter(distROIsf, province == "Hà Giang"), aes(x= log10(den_km2), y = ROI), color='red')
Generally, risk of infection increases with population density. Note that the districts in the Hà Giang province no longer have have particularly high ROI for their relatively low population densities as highlighted in red.
Let’s aggregate the colocation data from district-level to province-level: \[ \begin{aligned} R_{0 ij} = R_{0 \text{ Hồ Chí Minh}} \frac{\sqrt{\sum_{k \in P_i} \sum_{l \in P_j} C_{kl} N_l}}{\sqrt{\sum_{k \in P_{HCM}} \sum_{l \in P_{HCM}} C_{kl} N_l}} \end{aligned} \] where \(R_{0 \text{ Hồ Chí Minh}}\) is the reference \(R_0\) assigned a value of 18 and \(P_i\) is the set of districts in province \(i\). The numerator is the square root of the average number of individuals from province \(j\) that colocate with an individual from province \(i\) per unit time, whereas the denominator is the square root of the average number of colocations among individuals in Hồ Chí Minh.
Let’s compute the \(R_{0 ij}\) values with the standard reference province of Hồ Chí Minh assigned an \(R_0\) of 18.
prov_coloc <- reshape2::melt(coloc_mat) %>%
setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
dplyr::right_join(select(fb_pop, polygon1_id, province), "polygon1_id") %>%
rename(province_1 = province) %>%
dplyr::right_join(select(fb_pop, polygon1_id, province, fb_population_1), c("polygon2_id" = "polygon1_id")) %>%
rename(province_2 = province, fb_population_2 = fb_population_1) %>%
mutate(coloc_num = link_value * fb_population_2) %>%
group_by(province_1, province_2) %>%
summarise(coloc_value = sum(coloc_num))
## `summarise()` regrouping output by 'province_1' (override with `.groups` argument)
ref_prov <- prov_coloc %>%
arrange(desc(coloc_value)) %>%
head(1) %>%
pull(coloc_value)
prov_coloc %<>% mutate(R0 = 18 * sqrt(coloc_value) / sqrt(ref_prov))
The following code summarizes the province-level population and geographical data from GADM/census data set:
prov_polygons <- dist_polygons %>%
group_by(province) %>%
summarise(total_area_km2 = sum(area_km2), total_pop = sum(n), geom = st_union(geometry)) %>%
mutate(total_den_km2 = total_pop / total_area_km2) %>%
select(province, total_area_km2, total_pop, total_den_km2, geom)
## `summarise()` ungrouping output (override with `.groups` argument)
Intracity \(R_0\) is defined as \(R_{0 ii}\).
Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.
intracityR0 <- prov_coloc %>%
filter(province_1 == province_2) %>%
rename(intraR0 = R0) %>%
select(province_1, intraR0) %>%
rename(province = province_1)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)
The provinces with the largest intra-province \(R_0\) values are:
intracityR0 %>%
arrange(desc(intraR0)) %>%
head(10)
## # A tibble: 10 x 2
## # Groups: province [10]
## province intraR0
## <chr> <dbl>
## 1 Hồ Chí Minh 18
## 2 Hà Nội 12.9
## 3 Đà Nẵng 5.73
## 4 Hải Phòng 5.42
## 5 Bình Dương 5.12
## 6 Đồng Nai 4.30
## 7 Thanh Hóa 4.25
## 8 Cần Thơ 3.79
## 9 An Giang 3.78
## 10 Bắc Ninh 3.71
Let’s map the intracity \(R_0\) values for the provinces:
intracityR0sf <- prov_polygons %>%
right_join(intracityR0, "province") %>%
select(province, intraR0, everything())
intracityR0sf %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(intracityR0sf$intraR0, quantile(intracityR0sf$intraR0, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).
Let’s consider the intercity R0 values:
intercityR0 <- prov_coloc %>%
filter(province_1 != province_2) %>%
dplyr::rename(interR0 = R0)
totalInter <- intercityR0 %>%
group_by(province_1) %>%
summarize(interR0 = sum(interR0)) %>%
arrange(desc(interR0)) %>%
rename(province = province_1)
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.
hist2(totalInter$interR0, n = 50, xlab = "Intercity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)
Let’s map the intracity \(R_0\) values for the provinces:
intercityR0sf <- prov_polygons %>%
right_join(totalInter, "province") %>%
select(province, interR0, everything())
intercityR0sf %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(intercityR0sf$interR0, quantile(intercityR0sf$interR0, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
The districts with high intercity \(R_0\) values are clustered in high density/highly connected regions including Hanoi and Ho Chi Minh, which is what we would expect.
Let’s graph the relationship between population density and intercity \(R_0\) values:
ggplot(intercityR0sf, aes(x = log10(total_den_km2), y = interR0)) + geom_point()
As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.
Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:
tmp <- intercityR0sf %>%
right_join(intracityR0, "province")
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
geom_abline(slope = 1, intercept = 0)
Note that a large fraction of the inter-province \(R_0\) for the provinces is attributable to connections with Hồ Chí Minh and Hà Nội:
Histogram of the percent inter-province ROI attributable to connections with Hồ Chí Minh and Hà Nội:
intercityR0 %>%
right_join(totalInter, c("province_1" = "province")) %>%
rename(totalInter = interR0.y, interR0 = interR0.x) %>%
mutate(tmp = interR0 / totalInter * 100) %>%
filter(province_2 %in% c("Hồ Chí Minh", "Hà Nội")) %>%
group_by(province_1) %>%
summarise(perTotal = sum(tmp)) %>%
arrange(desc(perTotal)) %>%
pull(perTotal) %>%
hist2(., n = 50, xlab = "Percent interROI from Hồ Chí Minh/Hà Nộ", ylab = "number of districts", axes = FALSE)
## `summarise()` ungrouping output (override with `.groups` argument)
axis(1, seq(0, 100, 10))
axis(2)
Relationship between inter-province \(R_0\) and intra-province \(R_0\) with Hồ Chí Minh and Hà Nội connections removed:
intercityR0sf <- prov_coloc %>%
filter(province_1 != province_2) %>%
dplyr::rename(interR0 = R0) %>%
filter(province_2 %notin% c("Hồ Chí Minh", "Hà Nội")) %>%
group_by(province_1) %>%
summarize(interR0 = sum(interR0)) %>%
arrange(desc(interR0)) %>%
rename(province = province_1)
## `summarise()` ungrouping output (override with `.groups` argument)
tmp <- intercityR0sf %>%
right_join(intracityR0, "province")
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
geom_abline(slope = 1, intercept = 0)
Let’s calculate it for all districts:
ROIdf <- prov_coloc %>%
group_by(province_1) %>%
summarise(ROI = sum(R0)) %>%
dplyr::rename(province = province_1) %>%
arrange(desc(ROI)) %>%
mutate(ROI = ROI / head(., 1)$ROI)
## `summarise()` ungrouping output (override with `.groups` argument)
ROIsf <- prov_polygons %>%
right_join(ROIdf, "province")
It look like this:
ROIsf %>%
arrange(desc(ROI)) %>%
select(province, total_den_km2, ROI) %>%
st_drop_geometry() %>%
head(30)
## # A tibble: 30 x 3
## province total_den_km2 ROI
## <chr> <dbl> <dbl>
## 1 Hồ Chí Minh 3825. 1
## 2 Hà Nội 2286. 0.751
## 3 Đà Nẵng 1113. 0.323
## 4 Bình Dương 648. 0.321
## 5 Hải Phòng 1453. 0.317
## 6 Thanh Hóa 337. 0.308
## 7 Long An 367. 0.306
## 8 Đồng Nai 529. 0.284
## 9 Cần Thơ 888. 0.272
## 10 Bắc Ninh 1533. 0.269
## # … with 20 more rows
hist2(ROIsf$ROI, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1, 0.1))
axis(2)
Let’s map the risk of infection to the provinces:
ROIsf %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(ROIsf$ROI, quantile(ROIsf$ROI, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
We want to compare the province-level risk of infection with an aggregation of the district-level risk of infection:
First, aggregate the district-level ROI into provinces by taking mean ROI for each province:
tmp <- distROIdf %>%
right_join(select(as.data.frame(dist_polygons), province, polygon_id), "polygon_id") %>%
group_by(province) %>%
summarise(totalROI = sum(ROI)) %>%
arrange(desc(totalROI)) %>%
mutate(adjROI = totalROI / head(., 1)$totalROI)
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s collate the two sets of data:
bothROI <- ROIsf %>%
right_join(select(tmp, province, adjROI), "province") %>%
select(province, total_pop, ROI, adjROI, everything()) %>%
arrange(desc(ROI))
Let’s see how they compare:
bothROI %>%
ggplot(aes(x = ROI, y = adjROI)) +
labs(x = "Province-level ROI", y = "District-adjusted Province ROI") +
geom_point()
They appear generally consistent. It is worth noting, however, that the province-level ROI tend to be higher than the district-adjusted province-level ROI. That is, the risk of infection of the district as a whole is generally considered greater than the sum of the risk of infection of the individual districts that make up the province.